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ABSTRACT 

The immense increase in availability of genomic scale 
datasets, such as those provided by the ENCODE and 
Roadmap Epigenomics projects, presents unprece- 
dented opportunities for individual researchers to 
pose novel falsifiable biological questions. With this 
opportunity, however, researchers are faced with the 
challenge of how to best analyze and interpret their 
genome-scale datasets. A powerful way of represent- 
ing genome-scale data is as feature-specific coordin- 
ates relative to reference genome assemblies, i.e. as 
genomic tracks. The Genomic HyperBrowser (http:// 
hyperbrowser.uio.no) is an open-ended web server 
for the analysis of genomic track data. Through the 
provision of several highly customizable components 
for processing and statistical analysis of genomic 
tracks, the HyperBrowser opens for a range of 
genomic investigations, related to, e.g., gene regula- 
tion, disease association or epigenetic modifications 
of the genome. 



INTRODUCTION 

The immense increase in the production of genomic scale 
datasets, e.g., through the ENCODE (1) and Roadmap 
Epigenomics (2) projects, poses an unmet challenge in 
terms of available methodology and tools for analytic 
investigations. These datasets provide unprecedented 
opportunities for individual researchers to elucidate par- 
ticular biological mechanisms. However, analysis of these 
datasets and their relations to each other typically require 
development of a range of ad hoc scripts for generating, 
manipulating and analyzing genomic data. 

For a range of organisms, well-estabHshed and interna- 
tionally accepted reference genome assemblies now exist. 
Using coordinates on such assembhes, data related to par- 
ticular locations on the genome can be represented in a 
precise and unambiguous manner. This avoids many 
previous difficulties in the field, such as confusion due to 
incompatible gene terminology. A genome-wide collection 
of coordinates for a particular genomic feature is often 
referred to as a genome annotation track, or just 
genomic track. Such genomic tracks can, e.g., refer to 
the location of genes, binding of transcription factors, 
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methylation of DNA or modification of histones. 
Genomic tracks not only allow unified visualization and 
browsing, such as through the UCSC Genome Browser 
(3), but also provide a powerful and unified basis for stat- 
istical analysis. The base pair positions of reference 
genomes serve as coordinates on a line, allowing entities 
such as genes or epigenetic modifications to be viewed as 
elements positioned on such a line. A statistical question, 
posed on the relation between two genome-scale datasets, 
may then be formulated as a simple question relating such 
elements. An example is to ask whether points on a refer- 
ence line as defined by one dataset falls unexpectedly often 
within segments on the same line as defined by another 
dataset. 

The Genomic HyperBrowser web server provides a 
broad suite of functionality for rigorous statistical 
analysis of genomic data. At the core of the system is a 
set of statistical analyses, available through a single tool: 
'Analyze genomic tracks'. Descriptive statistics, test stat- 
istics and null models are described in terms of well- 
defined elements along a linear representation of the 
genome, in the form of genomic tracks. This tool and its 
underlying methodology has been described in a previous 
publication (4), and has since been expanded with tens of 
new descriptive analyses and hypothesis tests. The statis- 
tical analysis is augmented by a collection of data prepar- 
ation tools that support the processing of genomic data 
into forms that subsequently allow sophisticated questions 
to be posed in a simple and intuitive manner. All 42 tools 
at the server are based on the generic treatment of 
genomic data as elements along a hnear representation 
of the genome, allowing questions related to different bio- 
logical application domains to be treated in the same 
manner. The tools share an underlying analysis code 
base, which is open-source and tightly integrated with 
the Galaxy framework (5) for handling of web access, 
users and data. Through the integration with Galaxy, 
the standard Galaxy tools are also available and can be 
used together with the HyperBrowser-specific functional- 
ity. The HyperBrowser website is free and open to all, and 
there is no login requirement. 

The Genomic HyperBrowser is designed to be as open- 
ended as possible: instead of being developed around a few 
canonical usage scenarios, it provides a core set of abstrac- 
tions and components that can be used and combined in a 
myriad of ways to answer precisely formulated biological 
questions. Figure 1 gives a schematic overview of how 
various tools at the HyperBrowser server can be used as 
part of a full analysis scenario. 

ANALYSIS OF GENOMIC TRACKS 

A large collection of analytical functionahty is available 
through the tool 'Analyze genomic tracks' under the 
'HyperBrowser analysis' menu. This opens for a range 
of genomic investigations that query characteristics of in- 
dividual tracks or relations between pairs of tracks along 
the genome (4). After selecting one or a pair of tracks, the 
analysis of interest can be selected among a set of analyses 
deemed meaningful based on the type of track(s) selected. 
For instance, selecting two tracks of segments (intervals) 



along the genome (e.g. two tracks of ChlP-seq peak 
regions, without any values associated with the peaks) 
will allow questions related to co-localization (overlap). 
On the other hand, selecting two tracks of values per 
base pair along the genome (e.g. two tracks of bp-level 
ChlP-seq signal values for every position of the genome) 
will allow questions related to correlation of values. The 
HyperBrowser system distinguishes between 15 types of 
tracks at the generic level (6), where the most widespread 
types are tracks of points and segments. 

Analyses are divided into descriptive statistics (such as 
counts, base pair coverage and averages) and hypothesis 
tests (such as whether two tracks are overlapping more 
than expected by chance). A total of 56 descriptive statis- 
tics and 20 hypothesis tests are available, depending on the 
type of tracks (hsted in Table 1). Each hypothesis test may 
be seen as a generic genomic question that can be 
parameterized in several ways. The statistical testing pro- 
cedure used to resolve the question not only varies 
between questions, but also between parameterizations. 
One parameterization is the selection of an appropriate 
null model. Statistical hypothesis testing requires a 
notion of randomness for the null hypothesis, and 
careful attention has been given to making such random- 
ness assumptions transparent to the user. For most tests, 
the randomness assumptions can also be selected from a 
hst of possibly meaningful alternatives (Figure 2A). For 
instance, one can for hypothesis tests involving a gene 
track choose a simple nuU model where genes are 
randomized independently and uniformly along the 
genome. Alternatively, one can select a null model where 
the empirically observed clustering tendency of genes (dis- 
tribution of inter-gene distances) is preserved. A further 
alternative is to sample gene positions according to a sep- 
arately specified intensity track, which can for instance be 
used to control for influence by external confounders. 
Depending on the assumptions deemed appropriate by 
the user for the hypothesis test (through, e.g., the selection 
of a nuU model), the system will determine whether to use 
either an asymptotic computation or a Monte Carlo (MC) 
based evaluation of P-values. This is handled by the 
system, but at the same time transparent to the user. 
For MC-based evaluation of /"-values, a sequential 
sampling scheme, MCFDR, is used to automatically 
determine the appropriate number of samples for statis- 
tical testing (9). 

The output of the 'Analyze genomic tracks' tool 
(Figure 2B) presents the main conclusion from the 
analysis, along with some interpretations and restrictions 
on its applicability. This main conclusion is complemented 
by a range of detailed results in the form of tables and 
figures, provided at both the global level and for local 
regions along the genome. The tool emphasizes reprodu- 
cibility by providing rich analysis output, describing the 
methodologies that have been used, and reporting all par- 
ameter settings and data sources. Screencasts, tutorials 
and demo buttons for five genome analysis examples are 
provided with the tool. 

A set of tools focusing on visual analysis of track data is 
available under the menu "Visual analysis of tracks'. 
Under the menu 'Specialized analysis of tracks', we 
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Figure 1. Schematic overview of tool categories available at the 
Genomic HyperBrowser server. The figure indicates at which points 
of a typical analysis scenario the various tools may be of use, from 
the initial collection and preparation of data, through customization of 
data to match the analysis, to the statistical evaluation of a biological 
hypothesis. For boxes representing several tools, the precise list of tools 
can be found under the corresponding header in the table that is 
referred to (for instance, the two tools represented by the 'Format 
and convert' box can be found under the heading 'Format and 
convert tracks' of Table 3). 



provide a tool containing a recently developed hypothesis 
test querying whether the elements of a track are spatially 
co-localized with respect to the three-dimensional struc- 
ture of the genome, as defined using results from recent 
Hi-C experiments (10). A tool for unsupervised analysis of 
track similarities (clustering) is also available under the 
same heading (manuscript submitted). Tool details are 
given in Table 2. 



PROCESSING DATA INTO FORMS SUITABLE FOR 
ANALYSIS 

In many situations, a complex formulation of a biological 
question may be simplified if the original data are first 
transformed into a form that more directly reflects the 
question of interest. An example of this is a question of 
how often DNA binding locations of a given TF (as a first 
genomic track) fall inside or in the close vicinity of genes 
(as a second track). Although clearly manageable, the 
concept of proximity in this setting requires some 
thought and further specification. If one transforms the 
gene track by expanding the gene intervals to include, 
say, one kbp flanks, one can afterwards ask the more 
simple question of how often the TF binding locations 
fall inside these expanded gene intervals. This latter 
version is easy to envision and does not involve any am- 
biguity. This example shows the redefinition of a problem 
originally formulated to involve vicinity to fit with an 
analysis based on the simpler concept of containment. 
Thus, by combining a set of basic, generic analyses with 
a collection of track transformation functionahty, a core 
set of well-understood analyses can be applied to a much 
broader range of biologically motivated questions. Several 
tools for customizing data into forms that may simphfy 
subsequent analyses are available under the menu 
'Customize tracks', and are summarized in Table 3. 

In some analysis scenarios, a feature of interest is not 
explicitly available in the form of a genomic track, but can 
be derived from properties of other genomic tracks. The 
HyperBrowser menu 'Generate tracks' includes several 
tools for generation of datasets in such situations. 
Tracks can be generated based on DNA sequence 
properties along the genome, or based on density of, or 
distance to, certain genomic features along the genome. 
An overview of these tools is given in Table 3. 

In other analysis scenarios, genomic coordinates are 
available for the data of interest, but not in a format 
that can be readily used in the tool of interest. 
Genomic datasets come in a variety of forms, including 
raw lists of coordinates not adhering to any specified 
format. The data are usually in tabular forinat, typically 
as raw text files or as spreadsheet documents. The 
HyperBrowser recognizes most commonly used tabular 
formats, in addition to a recent unified format, GTrack, 
supporting aU 15 basic types of tracks handled by the 
system. A format conversion tool is available under the 
menu 'Format and convert tracks', alongside a tool for 
structuring raw tabular data into a GTrack file (Table 3). 
A set of tools for vahdating and editing GTrack files are 
also available, as introduced in (6). 
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Table 1. Selected descriptive 


statistics and hypothesis tests available through the 'Analyze genomic tracks' tool of the Genomic HyperBrowser 


Track 1 Track2 
type type 


Statistical investigation 


Description 


Descriptive statistics 


P 
P 
P 

P P 
P P 

P S 

P S 

P s 
P s 


Counts 
Frequency 

Mean and variance of gaps 
Frequency proportion 
Point distances 

Count inside/outside 

Matrix of count inside 

Relative position within segments 
Point to segment distances 


The number of track 1 -points 
The frequency of track 1 -points 
Mean and variance of gaps between track 1 -points 
The proportion of all points (track 1 and track2) arising from track 1 
The distribution of distances from each trackl-point to the nearest 
track2-point 

The number and proportion of trackl-points inside and outside 
track2-segments 

The number of trackl-points inside track2-segments, for all combinations 

of categories from both tracks 
The average relative position of trackl-points within track2-segments 
The distribution of distances from each trackl-point to the nearest 

track2-segment 


s 
s 
s 
s 

s s 
s s 
s s 


Bp coverage 
Proportional coverage 
Avg. segment length 
Segment lengths 
Coverage 
Enrichment 
Segment distances 


The number of base pairs covered by track 1 
The proportion of total base pairs covered by track 1 
The average length of segments of track 1 
The distribution of lengths of each track 1 -segment 
Base pair and proportional coverage by track 1, track2 and by both 
The enrichment of track 1 inside track2 and vice versa, at the bp level 
The distribution of distances from each track 1 -segment to the nearest 
track2-segment 


F 
F 
F 
F 

F P 
F S 
F F 


Mean 

Sum 

Variance 

Min and max 

Mean at points 

Mean inside and outside 

CC 


The mean value of track 1 

The sum of values of track 1 

The variance of values of track 1 

The extreme values (min/max) of track 1 

The mean value of track 1 at positions of track2 

The mean value of trackl inside track2 and outside track2 

Pearson's correlation coefficient of trackl and track2 


VP 

VP S 
VS (c/c) P 

VP (c/c) VS (c/c) 

VS (cat) 
VS (cat) 

VP (cat) VS (cat) 


Values 

Values inside 

Inside case versus control 

Two-by-two table of inside 

Category bp coverage 
Category point count 
Contingency table of inside 


The distribution of values of trackl -elements 

The distribution of values of trackl -elements inside track2-elements 
The number of track2-points that falls inside trackl -segments marked as case 
or control 

Two-by-two table of case/control trackl-points that falls inside case/control 

track2-segments 
The number of base pairs covered by each category of trackl 
The number of elements of each category of trackl 

Contingency table of categorical trackl-points that falls inside categorical 
track2-segments 


L 
L 

L (w) 
L (w) 


Number of nodes and edges 
Number of neighbors 

Edge weights 

Clustered heatmap of graph 


The number of nodes and edges in trackl 

The distribution of the number of neighbors for each node in the graph 
(trackl) 

The distribution of weights for each edge of the graph (trackl) 
Clustered heatmap of weights of the graph (trackl) 


Hypothesis tests 


P P 
P P 
P S 
P S 
P s 


Different frequencies? 

Located nearby? 

Located inside? 

Located non-uniformly inside? 

Located nearby? 


Where is the relative frequency of points of trackl different from the relative 
frequency of points of track2, more than expected by chance? 

Are the points of trackl closer to the points of track2 than expected by 
chance? 

Are the points of trackl falling inside the segments of track2, more than 

expected by chance? 
Do the points of trackl tend to accumulate more toward the borders of the 

segments of track2? 
Are the points of trackl closer to the segments of track2 than expected by 

chance? 


s s 
s s 
s s 


Similar segments? 
Overlap? 
Located nearby? 


Are trackl -segments similar (in position and length) to track2-segments, more 

than expected by chance? 
Are the segments of trackl overlapping the segments of track2, more than 

expected by chance? 
Are the segments of trackl closer to the segments of track2 than expected by 



chance? 



(continued) 
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Table 1. Continued 



Track 1 


Track2 


Statistical investigation 


Description 


type 


type 






F 


F 


Correlated? 


Are the values of track 1 and track! more positively correlated than expected 








by chance? 


P 


F 


Higher values at locations? 


Are the values of track2 higher at the points of track 1, than what is 








expected by chance? 


S 


F 


Higher values inside? 


Are the values of track! higher inside the segments of trackl, than what is 






expected by chance? 


P 


V J 


Located in segments with nign 


Does the number ot trackl -points that tall in trackz-segments depend on the 






values? 


value of track2-segments? 




VP 
V r 


Higher values inside segments? 


Do the points of track2 that occur inside segments of trackl have higher 








values than points occurring outside the segments of trackl? 


VP 


VP 


Nearby values similar? 


When trackl -points and track2-point are nearby each other, are the values 








more similar than expected by chance? 


p 


VS (c/c) 


Located in case segments 


Does the number of trackl-points that fall in track2-sesments depend on 








whether the track2-segments are marked as case or control? 


VS (c/c) 


S 


Preferential overlap? 


Are the segments of trackl marked as case overlapping unexpectedly more 








with the segments of track2 than the segments of trackl marked as 








control? 


VP (cat) 


VS (cat) 


Category pairs differentially co- 


Which categories of trackl-points fall more inside which categories of 






located? 


track2-segments? 


LGP 


P 


Co-localized in 3D? 


Are the points of track2 closer in 3D (as defined by trackl) than expected 








by chance? 



Each analysis is defined for either one or two tracks, with the corresponding track type denoted in the columns Trackl type' and ■Track2 type'. The 
track type abbreviations, as defined in (6), are as follows: Points (P), Segments (S), Valued Points (VP), Valued Segments (VS), Function (F), Linked 
Genome Partition (LGP) and any Linked (L) track. In addition, attached values are: number (default), case/control (c/c), category (cat) and weighted 
edges (w). Most hypothesis tests are available in one- and two-sided versions. Looking at, e.g., overlap, the possible alternative hypotheses would 
then be whether the segments of trackl are overlapping the segments of track2, more, less or differently than expected by chance. Results of the 
analyses are given both at the global level and for local regions along the genome. A few of the hypothesis tests relating points and/or segments are 
also available in specific libraries (7,8), but only for certain null models. In addition, these libraries require low-level command-line access, API access 
or configuration file setup in order to start analyses. 



A 

The Genomic HyperBrowser (vl.S.3) 

Genome build: [ Human FebTzOOg (hgl9/CRCh37) ; | O 

p First Track 

[ — From history (bed, wig, ...) — t ) 

("4: MS SNPs expanded 5 0ltb(hgl9] T) 

What is a genomic track? 

pSecond Track 

[ — From history {bed, wig, ...) ^| 

Case-control track of strong enhancer regi ! J 



- Analys i s 

Category: ( Hypothesis testing | Preferential overlap? ~r] ? 

Are 'Case-control track of strong enhancer regions from gml2787 and Hepg2 (11)' 
marked as case overlapping unexpectedly more with 'MS SNPs expanded SOkb (4)' 
than 'Case-control track of strong enhancer regions from gml2787 and HepgZ 
(11)' valued as control? 

pTrack type — — 

Treat ■Case-control track of strong enhancer regions from gml2787 and Hepg2 (11)' as: 
l_Original format (") 

Treat "MS SNPs expan ded SOkb (4)' as: 

_Original for mat <") 1 1 

7 

p Options 

Alternative hypothesis: \ more % | 

Null model: I Preserve segments of both tracks; permute cas 1 1 



B 

You asked: 

j Are 'Case-control track of strong enhancer regions from gml2787 and Hepg2' marked as case 
I overlapping unexpectedly more with 'MS SNPs expanded SOkb* than 'Case-control track of 
strong enhancer regions from gml2787 and Hepg2' valued as control? 

Simplistic answer 

Yes - the data suggests this (p-value: 0.004975) 
Precise answer: 

The p-value is 0.004975. 

Low p-values are evidence against HO. 

The test was also performed for each bin separately , resulting in 13 significant bins out of 27, at 
10% FDR* (16 bins excluded from FDR-analysis due lo lacking p-values). 

Please note that both the effect size and the p-value should be considered in order to assess the 
practical significance of a result. 

* False Discovery Rate: The expected proportion of false positive results among the significant 
bins is no more than 1096. 



P-values were computed under the null model defined by the following preservation and 
randomization rules: 

Preserve segments of both tracks; permute case and control assignment of Tl-segments 



The test statistic used is: 

Main result of analysis 
The value of the test statistic is 353 967. 



Figure 2. Screenshots of the web interface and results page for the 'Analyze genomic tracks' tool. (A) Input data, analyses of interest, and analysis 
parameters are precisely specified through a set of selection boxes. (B) The result page provides a main conclusion from the statistical test, as well as 
a range of details that can be inspected by following various links from the main results page. 
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Table 2. Tools for statistical, visual and specialized analyses of genomic tracks 



Tool name 



Description 



Genomic example 



Statistical analysis 
Analyze genomic tracks 



Visual analysis of tracks 

Visualize track elements 
relative to anchor regions 



Create high-resolution map of 
track distribution along 
genome 



Create high-resolution map of 
multiple track distributions 
along genome 



Visualize relation between two 

tracks across genomic 
regions 



Aggregation plot of track 
elements relative to anchor 
regions 



Specialized analysis of tracks 

Analyze co-localization of 
input genomic regions 



Perform clustering of genomic 
tracks 



Analyze k-mer occurrences 



Inspect k-mer frequency 
variation 



The main analysis interface of the Genomic HyperBrowser (4). 
Executes analyses on a single genomic track or on the relation 
between two tracks. Allows specification of additional input par- 
ameters for the analyses, specifically including the specification of 
alternative hypotheses and null models for the hypothesis tests. 
Contains 56 descriptive statistics and 20 hypothesis tests. 



Allows visualization of the distribution of track elements along 
chromosomes, or along custom-specified bins. The specified 
regions are displayed vertically, in order to simplify visual 
comparison. 

Visualizing track elements along a line, such as in the UCSC 

genome browser or the relative positioning visualization tool, can 
necessarily only offer a global overview at a very limited reso- 
lution. This tool instead uses a fractal layout of the genome line 
(similar to Hilbert curve (11) to map genome locations to indi- 
vidual pixels in a matrix instead of along a line, effectively 
increasing the resolution quadratically. Although the interpret- 
ation requires a certain effort, this form of visualization can po- 
tentially be very informative. 

Similar to the one-track version above, but uses up to three 
separate color channels (red,green,blue) to visualize the presence 
of up to three different tracks in corresponding parts of the 
genome by combining their color channel values at individual 
pixels. 

Used to reveal complex relations between tracks along the genome. 
For each defined analysis region (bin), a score is calculated for 
both tracks, using the specified summarizing function. The result- 
ing (x,y) scores are then visualized as a single point in a scatter 
plot. 

Used to reveal trends of how track elements are distributed relative 
to a set of anchor regions (bins). All anchor regions are divided 
into the same number of sub-bins, and a summary statistic is 
calculated for each sub-bin and averaged across all anchor 
regions. The tool returns a plot of the average values with 95% 
confidence intervals. 



Analyze a selected track of genome locations for spatial co-localiza- 
tion with respect to the three-dimensional structure of the 
genome, as defined using results from recent Hi-C experiments. 
The Hi-C data have been corrected for bias using a method pre- 
sented in a recent paper (10), and further normalized by sub- 
tracting the expected signal given the sequential distance between 
elements. 

Used to investigate relations between multiple tracks in an unsuper- 
vised manner (manuscript submitted). This tool allows an essen- 
tially unlimited number of tracks to be selected, and further 
allows the distance measure to be used for the clustering to be 
precisely specified through selection among a varied set of a 
notions of track similarity. 

Used to analyze a global track of occurrence locations for a 
specified k-mer from a particular reference genome. All relevant 
analyses in the 'Analyze genomic tracks' tool can be used. 

Used to calculate and visualize the frequency distribution of a par- 
ticular k-mer along a genome reference. Splits the selected 
analysis regions (e.g. chromosomes) into a suitable number of 
subregions (bins). For each bin, the number of occurrences of 
the selected k-mer is counted and plotted. 



Analyze cell-specificity of active 
chromatin in disease regions, 
as described in section 'Full 
analysis scenario. 



Visualize the detailed positioning 
of histone modifications 
relative to the TSS of a 
selected set of gene regions. 

Visualize the genome-wide distri- 
bution of a densely populated 
track, such as repeating 
elements or a DNase accessibil- 
ity experiment. 



Visualize the comparative distri- 
bution of DNase accessibility 
in three different cell types to 
see patterns of similar and 
distinct accessibility. 

Plot exon density versus average 
melting temperature in 10 mbp 
bins along the genome. 



Positions of histone modifications 
around TSS. 



Analyze whether somatic muta- 
tions in cancer are co-localized 
in 3D in a relevant cell type. 



Analyze similarities between 
histone modifications in differ- 
ent cell types. 



Analyze correlation of a specific 
k-mer with other tracks, e.g. 
genes, in order to find func- 
tional significance. 

Inspect the frequency variation of 
a particular k-mer along the 
genome. 



Further descriptions are given at the web pages of the tools themselves, along with demo buttons and links to reproducible examples of how each 
tool can be used. The 'Analyze genomic tracks' tool has previously been described (4). 
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Table 3. Tools for extracting genomic tracks from the HyperBrowser repository, customizing tracks into forms suitable for a subsequent analysis 
of interest, generating new tracks, and formatting and converting existing tracks 



Tool name 



Description 



Genomic example 



HyperBrowser track repository 
Extract track from 
HyperBrowser repository 



Customize tracks 

Expand BED segments 



Combine two BED files 
into single case-control 
track 



Merge multiple BED files 
into single categorical track 



Generate tracks 

Generate bp-level track 
from DNA sequence 



Generate bp-level track of 
distance to nearest segment 

Generate intensity track for 
confounder handling 



Generate k-mer occurrence 
track 

Generate track of genes 
associated with literature 
terms (using Coremine) 



Format and convert tracks 

Convert between GTrack/BED/ 
WIG/bedGraph/GFF/ 
FASTA files 



Create GTrack file from 
unstructured tabular data 



Used to extract datasets from the track repository stored on the 
HyperBrowser server. Datasets can be extracted in a range of 
different formats, and from limited regions of the genome, if 
needed. Also, overlapping segments can be merged. 

Allows extracting start-, mid- or endpoints of genomic intervals, as 
well as expanding either the original intervals or the extracted 
start-/end-/mid-points. This is useful in a variety of situations 
where an analysis of interest involves either proximity to or pos- 
itioning relative to the original track elements, or where a size 
unification of track elements is desired (based on, e.g., taking 
midpoints and then expanding a certain distance). Also, if the 
expanded region crosses any chromosome borders, this is 
handled correctly. 

Allows combining elements from two separate datasets into a 
single track where the elements are denoted as case (target) or 
control, depending on their source. This allows analyses of how 
other tracks preferentially interact with case elements as opposed 
to control elements. 



Allows combining elements from multiple datasets into a single 
track, denoted with a category that reflects their source. 



Supports a rich set of possibilities for constructing tracks based on 
the DNA sequence itself along a reference genome. 

Allows the generation of tracks giving for each bp the distance (in 
bps) to the nearest element in any track. 

Generates so-called 'intensity tracks' which are used in controlling 
for confounder tracks in particular analyses. The user selects a 
target track as well as a set of control tracks, i.e. a set of tracks 
whose influence on the target track one aims to control for. The 
generated intensity track defines, for each base pair, the prob- 
ability that an element of the target track lands at that position 
during randomization. The intensity track can afterwards be 
selected as part of the null model specification when doing hy- 
pothesis testing through the 'Analyze genomic tracks' tool. 

Generates a global track of occurrence locations for a specified 
k-mer on a particular reference genome. 

Generates a track of gene segments along the human genome, 
where the genes are associated with one or more specified litera- 
ture terms. The associations are provided by the CoreMine 
medical database, which is regularly updated with term-gene 
associations mined from published literature. 

The most commonly used formats for genomic location data are 
(arguably) the formats BED, BedGraph and WIG defined by the 
UCSC Genome Browser, as well as the format GFF in various 
versions. The tool allows converting between these formats, to 
the degree they are able to represent the same information. The 
tool also allows converting data to and from the recent GTrack 
format, which is a recent, unified format that is capable of repre- 
senting data of any track type, and thus data stemming from 
any of the other file formats (6). 

The tool allows structuring unformatted tabular data into a 
GTrack file by specifying the necessary meta-data through 
simple selection boxes, inferring further properties of the 
data where possible. 



Extract the RefSeq gene track, in order 
to expand the gene segments with 
the 'Expand BED segments' tool. 



An example of an analysis involving 
both proximity and relative position- 
ing is the analysis of histone modifi- 
cation frequencies in bins of 
particular distances relative to the 
upstream end points of genes 
(transcription start sites). 



An example is to combine chromatin 
states from two different cell types as 
case and control elements, in order 
to ask whether regions associated to 
MS susceptibility overlap more with 
case than control segments. See 
section 'Full analysis scenario'. 

Merge segment tracks denoting, e.g., 
exons, introns and intergenic regions 
in order to create a category track 
spanning the whole genome. 

Construct a bp-level track of GC 
content in a sliding window of select- 
able size along the genome. 

Generate a bp-level track of distance to 
nearest gene. 

Can, e.g., be used to control for the in- 
fluence of gene proximity when 
analyzing the relation between TF 
binding locations and active regions 
in a given cell type. 



Generate a track of all occurrences of 
the 8-mer 'ACGTTGCA' in the 
human hgl9 genome assembly. 

Find a set of genes associated with 
melanoma. Each gene will have an 
attached /"-value, denoting the 
strength of the association. 



Convert a GTrack file to the BED 
format in order to use BED-specific 
Galaxy tools. 



Import virus integration sites of the 
Human Papilloma Virus (HPV) from 
an Excel spreadsheet into a GTrack 
file for further analysis by the 
'Analyze genomic tracks' tool. 



Further descriptions are given at the web pages of the tools themselves, along with demo buttons and links to reproducible examples of how each 
tool can be used. The GTrack-related tools have previously been described (6). 
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SUPPLEMENTING GUI SELECTION WITH 
COMMAND-BASED BATCH EXECUTION 

A web interface based primarily on point-and-click selec- 
tion has several advantages compared to a command-line- 
based approach to data analysis. A main advantage is that 
it does not require the recollection of suitable commands 
and parameters to achieve a given analysis objective. 
A typical disadvantage is that it may be cumbersome 
to perform a multitude of similar analyses. This is in 
contrast to the command-based approach, where shght 
modifications to an analysis can often be done very 
quickly, and where looping may allow multiple analyses 
to be performed without a huge manual effort. We believe 
this is rapidly becoming an important issue for gen- 
ome analysis, as e.g. the ENCODE and Roadmap 
Epigenomics projects generate chromatin and transcrip- 
tion factor binding tracks for hundreds of different cell 
types. 

To meet this challenge, we have combined advantages 
of both worlds, the point-and-click based and the 
command based, through what we refer to as 'batch 
execution functionahty'. For the initial specification of 
an analysis, we mainly rely on a GUI-based approach, 
using selection boxes as described in the section 
'Analysis of genomic tracks'. After an analysis has been 
specified through the GUI, one can click on 'Inspect par- 
ameters of the analysis' to obtain a 'corresponding batch 
command line'. This purely textual representation of the 
analysis can now be modified and/or duplicated according 
to customized needs, and executed in the 'Execute batch 
commands' tool under the menu 'Text-based analysis 
interface'. Two options that increase the flexibility is the 
possibihty to use a slash (/) to denote that an analysis is to 
be performed with multiple alternative tracks or param- 
eter values, and the use of a star character (*) to denote 
that a given analysis is to be performed on all sub-tracks at 
a given level of the HyperBrowser track collection hier- 
archy. These extensions of the format greatly simplify the 
process of running a given analysis on a set of related 
tracks, e.g., for different chromatin marks or ceU fines. 

FULL ANALYSIS SCENARIO 

The full reach of the Genomic HyperBrowser system 
becomes apparent when considering the combination of 
various tools for processing and analyzing data. By em- 
ploying an appropriate combination of data preparation 
and analysis functionahty, a range of sophisticated and 
precisely specified hypotheses can be investigated. 

An example of such an analysis is the investigation of 
whether regions associated with a given disease overlap 
preferentially with marks of active chromatin in a 
certain cell type compared to another reference cell type. 
A sequence of steps for analyzing multiple sclerosis (MS) 
associated regions in B-cells versus hepatocytes is given in 
a Galaxy Page at http://bit.ly/hb_example. This page 
shows the sequence of tools that has been used, along 
with the exact input parameters and resulting outputs 
for each of the tools. Any step can be easily reproduced 
exactly or with modifications to the input parameters. 



The analysis starts with a set of SNP coordinates in a 
form reflecting a typical starting point with data in a 
raw text or a spreadsheet document. The SNP data are 
uploaded and formatted, and two genomic tracks of active 
chromatin state regions (12) in B-cells and hepatocytes 
are extracted from the HyperBrowser track repository. 
In their original track representations, the question 
of interest would be whether the track of active regions 
in B-cells shows a stronger presence in the vicinity of 
SNP positions than the hepatocyte track, after appropri- 
ate normalization based on overall differences between 
the tracks of active regions. Both the concept of vicinity 
and the need for normalization complicates the precise 
formulation of an appropriate question. By expanding 
the SNPs to include flanks, and by combining the two 
tracks of active regions into a single case-control 
track, the final question becomes whether the MS SNP 
proximity regions overlap preferentially with segments 
of the combined active chromatin state track marked 
as case versus control. As can be seen from the result 
output of the final step of the analysis, this is indeed the 
case (13). 

The Genomic HyperBrowser is complementarily 
integrated with other systems for working with genomic 
track data, both conceptually and implementation-wise. A 
powerful way to work with genomic data may be to, e.g., 
first get some general impressions and ideas about the data 
through direct visualization and browsing in the UCSC 
genome browser (3), followed by genome-scale explor- 
ation using EpiExplorer (14). Relevant hypotheses may 
then be evaluated by robust statistical analysis within 
the Genomic HyperBrowser. Throughout such an 
analysis scenario, one may also use a variety of Galaxy 
tools that work weU together with all the mentioned 
systems. 

CONCLUSIONS 

The Genomic HyperBrowser is a comprehensive system 
for statistical analysis of genomic tracks. A range of 
genomic investigations can be addressed through a com- 
bination of data processing and analysis tools. Novel 
features and analyses are continually added to the 
system. Furthermore, if a user faces a track analysis chal- 
lenge that cannot be resolved through the present version 
of the system, we take it upon us to react promptly to 
expand the system. 
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